Performance Evaluation of an Efficient Frequent Item sets-Based Text Clustering Approach

نویسنده

  • S.Murali Krishna
چکیده

The vast amount of textual information available in electronic form is growing at a staggering rate in recent times. The task of mining useful or interesting frequent itemsets (words/terms) from very large text databases that are formed as a result of the increasing number of textual data still seems to be a quite challenging task. A great deal of attention in research community has been received by the use of such frequent itemsets for text clustering, because the dimensionality of the documents is drastically reduced by the mined frequent itemsets. Based on frequent itemsets, an efficient approach for text clustering has been devised. For mining the frequent itemsets, a renowned method, called Apriori algorithm has been used. Then, the documents are initially partitioned without overlapping by making use of mined frequent itemsets. Furthermore, by grouping the documents within the partition using derived keywords, the resultant clusters are obtained effectively. In this paper, we have presented an extensive analysis of frequent itemset-based text clustering approach for different real life datasets and the performance of the frequent itemset-based text clustering approach is evaluated with the help of evaluation measures such as, precision, recall and F-measure. The experimental results shows that the efficiency of the frequent itemset-based text clustering approach has been improved significantly for different real life datasets. Keywords-Text mining, Text clustering, Text documents, Frequent itemsets, Apriori, Reuter-21578, Webkb dataset, 20newsgroups.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effective Term Based Text Clustering Algorithms

Text clustering methods can be used to group large sets of text documents. Most of the text clustering methods do not address the problems of text clustering such as very high dimensionality of the data and understandability of the clustering descriptions. In this paper, a frequent term based approach of clustering has been introduced; it provides a natural way of reducing a large dimensionalit...

متن کامل

Performance Improvement for Frequent Term-based Text Clustering Algorithm

Frequent term-based text clustering [2] is a recently introduced text clustering technique, which uses frequent term sets and dramatically decreases the dimensionality of the document vector space, thus especially addressing itself to the problems of text clustering: very high dimensionality of the date and very large size of the databases [2]. Moreover, frequent term sets provide understandabl...

متن کامل

A Clustering Based Location-allocation Problem Considering Transportation Costs and Statistical Properties (RESEARCH NOTE)

Cluster analysis is a useful technique in multivariate statistical analysis. Different types of hierarchical cluster analysis and K-means have been used for data analysis in previous studies. However, the K-means algorithm can be improved using some metaheuristics algorithms. In this study, we propose simulated annealing based algorithm for K-means in the clustering analysis which we refer it a...

متن کامل

Investigate the Performance of Document Clustering Approach Based on Association Rules Mining

The challenges of the standard clustering methods and the weaknesses of Apriori algorithm in frequent termset clustering formulate the goal of our research. Based on Association Rules mining, an efficient approach for Web Document Clustering (ARWDC) has been devised. An efficient Multi-Tire Hashing Frequent Termsets algorithm (MTHFT) has been used to improve the efficiency of mining association...

متن کامل

A Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset

Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010